-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetch models results from HF #43
base: main
Are you sure you want to change the base?
Conversation
This is awesome! You're doing god's work here :D Thanks |
I think adding something invalid for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good! We should probably do this as a one-time thing only (don't support it going forward).
"evaluation_time": 0, | ||
"mteb_version": "0.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe we should make it clear that this is NA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(potentially by wrapping the original object to allow for None
You mean that we should fetch results only one time for each model and not check them for updates? |
You mean that we should remove from |
Hey @Samoed wow thanks for working on this!
Re:
Let's do that for sure.
I think this could be a great main option forward, instead of PRs to the results repo (i.e. both are accepted, but the MTEB folder is preferred). People's results should be generated with MTEB version and eval time (see above). For time to be meaningful, we might also want to track hardware info, right? Every time we scan for the MTEB folders, we can validate the results objects as well. I'm all for automating as much as we can :P |
Slightly related to the eval time topic, maybe we could start enforcing This will add 1 new dependency in MTEB but I think the benefits is worth it. |
+1 for enabling code-carbon by default re -1 for values: This would lead to issues when e.g. taking a mean. I would use math.nan instead (it is a float and an invalid value). It sounds like for now:
|
For now what I should change, except |
@KennethEnevoldsen in this PR embeddings-benchmark/mteb#1398, -1 was used in favor of NaN. To be consistent, should we keep the -1 approach? Unless some metrics aren't between [0-1]. |
That PR is more about adding a warning to existing code (I also think it is a bad idea in that case) Would def. change it to NA here. |
@Samoed if we add the NA here (runtime, version) I believe this is ready to merge. It is one of the few missing requirements for the new release of the leaderboard 2.0 |
@KennethEnevoldsen I've updated results |
Hmm I believe NaN is not valid json, shouldn't it be "null"? If it can be read I guess it doesn't matter. for the MTEB version I would probably change it to: |
Yes, it’s a bit invalid as JSON, but Python can read it correctly. |
It also seems the loader requires the version to be a parsable string |
I have added a PR which should allow the fix |
@Samoed feel free to just merge it in |
@KennethEnevoldsen I don't have permission to merge in this repository. Also, should we keep |
Either is fine - I just wanted to make sure that the json was generally valid |
@Samoed I think the update is in on main - updating to the latest version would allow us to get this merged in |
Oh god, the problem seems to be that if the mteb_version is None, then TaskResult will try to load as if the results from before version |
Oooor (and I saw you did this at one point @Samoed ), we can say something like |
@x-tabdeveloping I think that in embeddings-benchmark/mteb#1453 issue with loading was fixed |
what's up with the test failing then? |
It was not updated after PR merge. I tried with new mteb verison, but found small issue |
I've implemented a base version of retrieving results from the model card. This PR is a work in progress and is open for discussion. Currently, I’ve parsed the results for
intfloat/multilingual-e5-large
as a demo. During integration with other results, I ran into some issues, particularly with loading results usingTaskResult
. External results don’t haveevaluation_time
andmteb_version
, whichTaskResult
requires.@KennethEnevoldsen @x-tabdeveloping
Ref: embeddings-benchmark/mteb#1405, embeddings-benchmark/mteb#1373